Refactor XPath evaluation and optimize memory allocation by jeffhuen · Pull Request #1 · jeffhuen/RustyXML

jeffhuen · 2026-02-16T19:12:06Z

Summary

This PR refactors XPath evaluation logic, consolidates duplicate code, and optimizes memory allocation patterns throughout the codebase. The changes improve correctness of XPath comparisons, reduce unnecessary cloning, and prevent unbounded memory growth in long-lived parsers and accumulators.

Key Changes

XPath Evaluation Improvements

Consolidated node text extraction: Moved get_node_text_content and collect_text_content logic into a shared dom::node_string_value function to eliminate duplication and ensure consistent XPath string-value semantics across the codebase
Fixed XPath comparison semantics: Updated compare_values to use actual node string-values instead of formatting raw node IDs as numbers, which was both semantically incorrect and wasteful
Added #[must_use] attributes: Applied to evaluate, evaluate_from_node, compile, and XPathValue to catch unused results at compile time

Memory Optimization

XPath expression caching: Changed cache to store Arc<CompiledExpr> instead of cloning entire compiled expressions on cache hits, reducing allocations from deep clones to cheap pointer bumps
Streaming parser buffer management: Added shrink_to() calls after draining buffers to prevent unbounded capacity growth in long-lived parsers
Event vector shrinking: Shrink events and complete_elements vectors after partial drains to release excess capacity
Document accumulator sizing: Reduced initial buffer capacity from 64KB to 4KB to avoid wasting memory for small documents while still allowing growth as needed
Index builder cleanup: Added shrink_to_fit() call after building document indices to reclaim over-allocated capacity from initial size estimates

Error Handling

NIF safety: Updated empty_binary to handle potential allocation failures gracefully instead of panicking in a BEAM NIF

Documentation

Enhanced comments explaining XPath 1.0 spec compliance, caching strategy, and memory management rationale

https://claude.ai/code/session_015igpdCrNYKuoPrHWZ5RXYc

Memory leak fixes: - StreamingParser: buffer, events, and complete_elements vecs never shrank after drain operations, causing unbounded growth in long-lived parsers. Added shrink_to() calls matching StreamingSaxParser behavior. - StructuralIndex: shrink_to_fit() existed but was never called after building. Initial capacity estimates over-allocate by 2-3x; now reclaimed immediately after build_children_from_parents(). - DocumentAccumulator: reduced default pre-allocation from 64KB to 4KB. The old 64KB multiplied quickly across concurrent accumulators. Connection/contention fix: - XPath cache: replaced deep CompiledExpr cloning with Arc<CompiledExpr>. Every cache hit previously cloned all Vec<Op>, Strings, and Box<CompiledExpr> recursively. Now it's a cheap Arc pointer bump. Correctness + memory fix: - compare_values in XPath eval: was using format!("{}", node_id) to compare nodes, which compared raw u32 IDs as strings — both wrong per XPath 1.0 spec and wasteful (O(n*m) String allocations). Now uses actual document text content via node_string_value(). - Consolidated duplicated get_node_text_content/collect_text_content from lib.rs into shared dom::node_string_value(). Safety fix: - empty_binary: replaced .unwrap() on OwnedBinary::new(0) with match to avoid potential NIF panic. Best practices: - Added #[must_use] to evaluate(), evaluate_from_node(), compile(), validate_strict(), and XPathValue type. https://claude.ai/code/session_015igpdCrNYKuoPrHWZ5RXYc

…e all clippy warnings - Fix XPathValue::to_string_value() to return empty string for NodeSets instead of misleading format!("[node:{}]", id) — callers with document access now use dom::node_string_value() per XPath 1.0 spec - Add resolve_string() helper to XPath functions for proper NodeSet-to-string conversion; pass document access to all 8 string functions - Replace duplicated get_string_value/collect_text with dom::node_string_value - Remove residual get_node_text_content wrapper from NIF layer - Optimize NodeSet-vs-NodeSet comparison from O(n*m) to O(n+m) by pre-computing right-side string values - Fix streaming parser shrink_to() reallocation churn with 4x threshold - Remove redundant #[must_use] on compile() (Result already has it) - Remove dead inherent methods on XmlDocument duplicated by DocumentAccess trait - Remove unused XmlAttribute re-export from dom/mod.rs - Fix len_zero clippy warnings in unified_scanner tests - Add #[expect(clippy::ptr_arg)] to intern_cow (needs Cow for zero-copy optimization) - Add 4 correctness tests for NodeSet equality and string function semantics Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Replace expect() in XPath parser peek() with match + defensive Eof fallback — eliminates BEAM VM crash risk if invariant is violated - Replace expect() in XPath lexer read_string() with unwrap_or defensive fallback — same rationale - Thread document access into compare_numbers() so relational operators (< <= > >=) properly resolve NodeSet text content before numeric conversion — fixes silent wrong results for expressions like /r/price > 10 where <price>42.5</price> - Add resolve_number() helper mirroring resolve_string() pattern - Add test for relational operators on NodeSets Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude and others added 3 commits February 16, 2026 18:07

jeffhuen merged commit f33c37a into main Feb 16, 2026

jeffhuen deleted the claude/fix-memory-leaks-rust-bCi0M branch February 16, 2026 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor XPath evaluation and optimize memory allocation#1

Refactor XPath evaluation and optimize memory allocation#1
jeffhuen merged 3 commits intomainfrom
claude/fix-memory-leaks-rust-bCi0M

jeffhuen commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeffhuen commented Feb 16, 2026

Summary

Key Changes

XPath Evaluation Improvements

Memory Optimization

Error Handling

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants